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The issue of model selection has drawn the attention of both applied and 
theoretical statisticians for a long time. Indeed, there has been an enor- 
mous range of contribution in model selection proposals, including work by 
Akaike (1973), Mallows (1973), Foster and George (1994), Birge and Mas- 
sart (2001a) and Abramovich, Benjamini, Donoho and Johnstone (2000). 
Over the last decade, modern computer-driven methods have been devel- 
J2 ■ oped such as All Subsets, Forward Selection, Forward Stagewise or Lasso. 

Such methods are useful in the setting of the standard linear model, where 
we observe noisy data and wish to predict the response variable using only 
a few covariates, since they provide automatically linear models that fit the 

^ , data. The procedure described in this paper is, on the one hand, numeri- 

cally very efficient and, on the other hand, very general, since, with slight 

■»^ ■ modifications, it enables us to recover the estimates given by the Lasso and 

^ ! Stagewise. 

o 

Q . 1. Estimation procedure. The "LARS" method is based on a recursive 

procedure selecting, at each step, the covariates having largest absolute cor- 
relation with the response y. In the case of an orthogonal design, the esti- 
a mates can then be viewed as an /^-penalized estimator. Consider the linear 

regression model where we observe y with some random noise e, with or- 
*> • thogonal design assumptions: 

><: y = Xf3 + e. 

H 



OO 






c^ 



Using the soft-thresholding form of the estimator, we can write it, equiva- 
lently, as the minimum of an ordinary least squares and an l^ penalty over 
the coefficients of the regression. As a matter of fact, at step k = 1, . . . ,m, 
the estimators /j'^ = X~^jl^ are given by 

fi^ = Higumi{\\Y - fi\\l + 2\l{k)Mi). 
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2 DISCUSSION 

There is a trade-off between tlie two terms, balanced by tlie smoothing 
decreasing sequence A^(A:). The more stress is laid on the penalty, the more 
parsimonious the representation will be. The choice of the l^ penalty enables 
us to keep the largest coefficients, while the smallest ones shrink toward zero 
in a soft-thresholding scheme. This point of view is investigated in the Lasso 
algorithm as well as in studying the false discovery rate (FDR). 

So, choosing these weights in an optimal way determines the form of the 
estimator as well as its asymptotic behavior. In the case of the algorithmic 
procedure, the suggested level is the {k + l)-order statistic: 

>^lik) = \y\(k+i)- 

As a result, it seems possible to study the asymptotic behavior of the LARS 
estimates under some conditions on the coefficients of /3. For instance, if there 
exists a roughness parameter p G [0,2], such that J2T=i \(^j\^ ^ -^) metric 
entropy theory results lead to an upper bound for the mean square error 
11/3 — /? p. Here we refer to the results obtained in Loubes and van de Geer 
(2002). Consistency should be followed by the asymptotic distribution, as is 
done for the Lasso in Knight and Fu (2000). 

The interest for such an investigation is double: first, it gives some insight 
into the properties of such estimators. Second, it suggests an approach for 
choosing the threshold A,^ which can justify the empirical cross-validation 
method, developed later in the paper. Moreover, the asymptotic distribu- 
tions of the estimators are needed for inference. 

Other choices of penalty and loss functions can also be considered. First, 
for 7 G (0, 1] , consider 

m 

If 7 < 1, the penalty is not convex anymore, but there exist algorithms to 
solve the minimization problem. Constraints on the P norm of the coeffi- 
cients are equivalent to lacunarity assumptions and may make estimation of 
sparse signals easier, which is often the case for high-dimensional data for 
instance. 

Moreover, replacing the quadratic loss function with an l^ loss gives rise 
to a robust estimator, the penalized absolute deviation of the form 

jl'' = arg min (||y - /i||„,i + 2A2 (A;)||^||i). 



Hence, it is possible to get rid of the problem of variance estimation for the 
model with these estimates whose asymptotic behavior can be derived from 
Loubes and van de Geer (2002), in the regression framework. 

Finally, a penalty over both the number of coefficients and the smoothness 
of the coefficients can be used to study, from a theoretical point of view. 
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the asymptotics of the estimate. Such a penalty is analogous to complexity 
penalties studied in van de Geer (2001): 

/i* = arg min (||y - /i||2 + 2A2(A;)||^||i + pen(A;)). 

-^",fce[l,m] 



2. Mallows' Cp. We now discuss the crucial issue of selecting the num- 
ber k of influential variables. To make this discussion clear, let us first assume 
the variance a^ of the regression errors to be known. Interestingly the pe- 
nalized criterion which is proposed by the authors is exactly equivalent to 
Mallows' Cp when the design is orthogonal (this is indeed the meaning of 
their Theorem 3). More precisely, using the same notation as in the paper, 
let us focus on the following situation which illustrates what happens in the 
orthogonal case where LARS is equivalent to the Lasso. One observes some 
random vector y in M", with expectation ^ and covariance matrix (T^/„,. The 
variable selection problem that we want to solve here is to determine which 
components of y are influential. According to Lemma 1, given /c, the fcth 
LARS estimate fL^ of ^ can be explicitly computed as a soft-thresholding 
estimator. Indeed, considering the order statistics of the absolute values of 
the data denoted by 

|y|(i) > \y\{2) >•••> |y|(n) 

and defining the soft threshold function r]{-,t) with level t > as 



r/(x,t)=x]l|^|>ih - — I j, 



one has 

Jj'k,i = r]{yi,\y\(k+i))- 
To select k, the authors propose to minimize the Cp criterion 

(2.1) (7p(/2fc) = \\y - M^ - na^ + '^^a^- 

Our purpose is to analyze this proposal with the help of the results on 
penalized model selection criteria proved in Birge and Massart (2001a, b). In 
these papers some oracle type inequalities are proved for selection procedures 
among some arbitrary collection of projection estimators on linear models 
when the regression errors are Gaussian. In particular one can apply them 
to the variable subset selection problem above, assuming the random vector 
y to be Gaussian. If one decides to penalize in the same way the subsets of 
variables with the same cardinality, then the penalized criteria studied in 
Birge and Massart (2001a, b) take the form 

(2.2) C'iJlk) = \\y - /ifcf - na^ + pen(A;), 
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where pen(fc) is some penalty to be chosen and Jlk denotes the hard threshold 
estimator with components 

Jj'k,i = r]'{yi}y\{k+i)), 
where 

r]'{x,t)=xl\^\yt. 

The essence of the results proved by Birge and Massart (2001a, b) in this 
case is the following. Their analysis covers penalties of the form 



pen{k) = 2ka'C ( log I - j + C" 

[note that the FDR penalty proposed in Abramovich, Benjamini, Donoho and Johnstone 
(2000) corresponds to the case C = 1]. It is proved in Birge and Massart 
(2001a) that if the penalty pen(A;) is heavy enough (i.e., C > 1 and C is 
an adequate absolute constant), then the model selection procedure works 
in the sense that, up to a constant, the selected estimator Jl- performs as 
well as the best estimator among the collection {]lk, 1 <k <n} in terms of 
quadratic risk. On the contrary, it is proved in Birge and Massart (2001b) 
that if C < 1, then at least asymptotically, whatever C", the model selection 
does not work, in the sense that, even if /i = 0, the procedure will systemati- 
cally choose large values of k, leading to a suboptimal order for the quadratic 
risk of the selected estimator Jl~. So, to summarize, some 2A;cr^ log(n/A;) term 
should be present in the penalty, in order to make the model selection crite- 
rion (2.2) work. In particular, the choice pen(A;) = 2ka'^ is not appropriate, 
which means that Mallows' Cp does not work in this context. At first glance, 
these results seem to indicate that some problems could occur with the use 
of the Mallows' Cp criterion (2.1). Fortunately, however, this is not at all 
the case because a very interesting phenomenon occurs, due to the soft- 
thresholding effect. As a matter of fact, if we compare the residual sums of 
squares of the soft threshold estimator Jl^ and the hard threshold estimator 
/ifc, we easily get 

n 

\\y - Afcf - \\y - M^ = E l2/l(fc+i)i|?/.i>M(fc+i) = ^lylfk+i) 

i=l 

SO that the "soft" Cp criterion (2.1) can be interpreted as a "hard" criterion 

(2.2) with random penalty 

(2.3) pen(/c) = fc|y|J,,_,,) + 2fca2. 

Of course this kind of penalty escapes stricto sensu to the analysis of Birge and Massart 
(2001a, b) as described above since the penalty is not deterministic. How- 
ever, it is quite easy to realize that, in this penalty, |y|?^ , j^-j plays the role of 
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the apparently "missing" logarithmic factor 2a'^log{n/k). Indeed, let us con- 
sider the pure noise situation where // = to keep the calculations as simple 
as possible. Then, if we consider the order statistics of a sample Ui,. . . ,Un 
of the uniform distribution on [0, 1] 

C/(i)<f/(2)<---<f/H, 

taking care of the fact that these statistics are taken according to the usual 
increasing order while the order statistics on the data are taken according 
to the reverse order, o"~^|y|?^ , -^n has the same distribution as 

where Q denotes the tail function of the chi-square distribution with 1 de- 
gree of freedom. Now using the double approximations Q~^{u) ~ 21og(|ti|) 
as u goes to and C^(fc+i) ~ (A: -|- l)/n (which at least means that, given 
k, nC/(fc_|_i) tends to k + 1 almost surely as n goes to infinity but can also 
be expressed with much more precise probability bounds) we derive that 
l2/|(fc+i) ~ 2(T^log(n/(A; + 1)). The conclusion is that it is possible to inter- 
pret the "soft" Cp criterion (2.1) as a randomly penalized "hard" criterion 
(2.2). The random part of the penalty A:|y|?^ , -^-, cleverly plays the role of 

the unavoidable logarithmic term 2a'^klog{n/k), allowing the hope that the 
usual 2ka'^ term will be heavy enough to make the selection procedure work 
as we believe it does. A very interesting feature of the penalty (2.3) is that 
its random part depends neither on the scale parameter o"^ nor on the tail 
of the errors. This means that one could think to adapt the data-driven 
strategy proposed in Birge and Massart (2001b) to choose the penalty with- 
out knowing the scale parameter to this context, even if the errors are not 
Gaussian. This would lead to the following heuristics. For large values of 
k, one can expect the quantity —\\y — JlkW^ to behave as an affine function 
of k with slope a{n)a^ . If one is able to compute a{n), either theoreti- 
cally or numerically (our guess is that it varies slowly with n and that it 
is close to 1.5), then one can just estimate the slope (for instance by mak- 
ing a regression of — 1|?/ — /i^lp with respect to k for large enough values of 
k) and plug the resulting estimate of a"^ into (2.1). Of course, some more 
efforts would be required to complete this analysis and provide rigorous or- 
acle inequalities in the spirit of those given in Birge and Massart (2001a, 
b) or Abramovich, Benjamini, Donoho and Johnstone (2000) and also some 
simulations to check whether our proposal to estimate a"^ is valid or not. 

Our purpose here was just to mention some possible explorations starting 
from the present paper that we have found very stimulating. It seems to us 
that it solves practical questions of crucial interest and raises very interesting 
theoretical questions: consistency of LARS estimator; efficiency of Mallows' 
Cp in this context; use of random penalties in model selection for more 
general frameworks. 
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